Spectral Coclustering to segment data science practitioners
This is a tutorial on using Spectral Coclustering to segment different groups of data science practinioners based on data from the Kaggle ML and DS survey. Coclustering groups together both rows and columns, and helps us focus on the features most pertinent to a given segment. We focus on applying Spectral Coclustering on the Kaggle Survey dataset and delve into some of the practical considerations. The code for the tutorial is available on GitHub.
- Clustering on Survey Data
- Selecting the number of clusters
- Survey Analysis
- Data science segments
- UMAP projection
- Conclusion
Clustering is a messy business. It is hard to find literature on best practices in a business context. How to choose a clustering model, and a proper distance measure for a given dataset? How to select the optimal number of clusters? In published research, scientist often apply clustering on datasets originally used for classification, and measure clustering performance against the ground truth labels. This gives us some insight about the applicability of the clustering method on the type of data used in the study, but it does little to show how clustering can be applied in a business setting, where there are typically no labels, and we need to rely heavily on domain knowledge.
In practice, clustering is most useful when we don't have labels and we are not even sure what the labels might be.
Clustering on Survey Data
Well-designed surveys can help us better understand our customers. Survey data can be particularly valuable if coupled with additional metrics on how the survey respondents use the product in questions.
In this tutorial we apply Spectral Coclustering on the 2019 Kaggle Machine Learning and Data Science Survey. At the end of each year Kaggle sends out this survey to their (very large) user base, with the aim of capturing a snapshot of the state of ML and Data Science. We don't have access to data on how the survey respondents use Kaggle, so instead we focus on the survey data itself. We will use coclustering to find segments of data science practitioners, and interpret the factors unique to each segment.
Along the way we discuss a few technical topics in greater detail. In the next section, we cover the adjusted Rand index as a way of measuring the agreement between two data partitions (e.g. cluster solutions). In the last section, we discuss how Uniform Manifold Approximation and Projection (UMAP) can be applied to the survey dataset. Feel free to skip these sections if you are already familiar with these techniques.
Why Spectral Coclustering for survey data?
Spectral Coclustering clusters both rows and columns of a dataset, such that each row belongs to one cluster, and similarly, each column belongs to a single cluster. If we sort both the rows and columns based on the cluster labels, we obtain a block-diagonal structure. This structure can be clearly seen in this example where spectral coclustering was applied on a synthetic dataset. In part 2 of this tutorial (soon to come), we will dive deeper into the inner workings of the algorithm.
Clustering both rows and columns is particularly useful with medium-to-large surveys that have a lot of multiple-choice questions, like the Kaggle survey. Below, we select 23 questions for the evaluation, which results in 231 unique responses (columns in the data table). Using standard clustering it is difficult to compare the different user-clusters across so many different columns. With Spectral Clustering, we can focus on the responses that are most pertinent to a given cluster. Let's take a look at the toy example from the picture above. The corresponding table with responses is shown below.
#collapse
import json
from collections import defaultdict
import altair as alt
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
from umap import UMAP
import umap.plot
from sklearn.cluster import SpectralCoclustering
from sklearn.metrics.cluster import adjusted_mutual_info_score, adjusted_rand_score
alt.data_transformers.disable_max_rows();
Here each column corresponds to a response to a multiple-choice questions from a survey, so the responses are encoded as binary features. For example, the first two columns represent responses to the question: "Select an activity that make an important part of your role at work." If we run biclustering on this dataset, we might obtain two clusters, color-coded orange and blue in the picture above. The blue cluster would contain rows (users) that tend to mark responses such as "build prototypes" (important activity at work), "Tensorflow" (what framework do you use?), and "CNNs" (what machine learning models do you use?). In this way, we segment the users and characterize them at the same time. Similarly, the orange cluster would be described by "analyze and understand data", "SQL" and "ggplot2". Of course, the separation is not perfect, and you can see "blue" users occasionally selecting "orange" responses and vice versa.
The Kaggle Survey questions
One of the most important decision when clustering survey responses is choosing the questions used for clustering, and the ones used for analysis and validation post-clustering. If we were doing this in a professional setting, this would be a good time to discuss the survey goals with our teammates.
Most of the questions in this survey ask about things data science practitioners do at work: the types of problems they solve, frameworks they use, models they build. So we will use the responses to these questions for clustering. The full list of questions, color-coded based on my selection, is available here.
The questions selected for analysis (post-clustering) relate to the users' prior experience (coding experience, formal education), as well as team size. This selection will allow us to make conclusions such as: Users from cluster X focus on building deep learning prototypes using Tensorflow. Most of them have at least 2 years of experience with coding for data science.
Question 5 ("select title most similar to your current role") is used to select an initial number of clusters - more on that below.
The respondents
Once we drop empty or nearly-empty responses, we are left with records from more than 18000 users. However, the survey was designed such that users who don't have lot of programming experience for data analysis (Q15) were not shown many of the questions. This effectively creates two groups: one that did not see most of the questions, and one that did. For the rest of the tutorial we focus on the more-experienced group since it likely better represents data science practitioners (and they also filled out most of the questions, so we have more interesting data to work with). If you are curious about the preprocessing steps, you can find them in this notebook.
#collapse
df = pd.read_feather('data/encoded_subset.feather')
res = pd.read_feather('data/processed_subset.feather')
with open('data/responses.json', 'r') as f:
uniques = json.load(f)
with open('data/short_questions.json', 'r') as f:
short_qs = json.load(f)
qs = pd.read_csv('data/questions_only.csv').T.squeeze()
The questions used in clustering include single- or multi-selections, so each selection (column) can be encoded as binary.
df.head()
We also include mappings between question numbers and short summaries of the question, as well as the actual responses. These will help us interpret the clusters later.
short_qs['Q20'], uniques['Q20_Part_1'], uniques['Q20_Part_2']
Selecting the number of clusters
There are a few important considerations when selecting the number of clusters for this type of task:
- Most humans will struggle to keep track of more than about 10 user segments. Since the overall purpose of this analysis is to better understand our users, we will need to communicate our results to business leaders: we can anticipate that slides will be drawn, presentations will be given, and reports will be written. Limiting ourselves to a few segments will help us draw a clearer picture of the user-base, especially when getting started.
- When doing the analysis, it is easy to overcluster (pick more clusters than what we expect the right number is), and then manually merge similar clusters together. Going in the opposite direction is harder.
Keeping these issues in mind, we will vary the number of clusters and measure the agreement between the resultant cluster solution and the responses from Question 5 (job role). We use this question because we would expect different data science segments to have distinct distributions of job roles. Furthermore, this question involves a single selection, so we obtain a single label (job role) per user. Note that Q5 is not our gold standard in terms of a cluster solution, we merely use it as a rough guide.
In terms of the actual metric to measure the agreement between the job role and cluster label, we can use the adjusted Rand index (ARI). It is a useful measure as it is corrected against chance - we delve into the details in the next section.
What about cluster evaluation metrics that do not require ground truth labels? Why not use silhouette coefficient or the Calinski-Harabasz index? Personally, I have not found these particularly useful, especially since each one tends to favor a particular type of clustering. If you have successfully used any of these metrics, let me know in the comments!
Rand Index
The Adjusted Rand Index is based on the Rand Index, so we need to understand the latter first. The Rand Index is a standard measure of the similarity of two partitions (e.g. cluster solutions) of a dataset. Let's suppose we want to measure the agreement between the following partitions:
from itertools import combinations
p1 = [0, 0, 0, 1, 1, 1, 2, 2, 2]
p2 = [0, 1, 0, 1, 1, 1, 2, 0, 2]
The Rand Index is calculated by looking at all pairs of items and finding:
-
a= number of pairs that are in the same subset inp1and also in the same subset inp2. -
b= number of pairs that are in different subset inp1and also in different subset inp2.
The following functions do this calculation.
def same_subset(items, part):
"""Given a pair of items, find out if they are in the same subset"""
i, j = items
return part[i] == part[j]
def set_pairs(part, same=True):
"""Given a partition, find the set of pairs that are in the
same (same=True) or different (same=False) partitions"""
if same:
selection = lambda pair: same_subset(pair, part)
else:
selection = lambda pair: not same_subset(pair, part)
n = len(part)
return set(filter(selection, combinations(range(n), 2)))
For example, these are the set of pairs that are in the same subsets in p1:
set_pairs(p1, same=True)
Once we can find the appropriate pairs, we can easily calculate the Rand Index. It is simply given by:
$$RI = \frac{a + b}{n_{pairs}} = \frac{2 (a + b)}{n (n - 1)}$$
where $n$ is the total number of items in our dataset.
def rand_index(p1, p2):
"""Compute the Rand index (not adjusted)"""
assert len(p1) == len(p2) and len(p1) > 2
a = len(set_pairs(p1, True) & set_pairs(p2, True))
b = len(set_pairs(p1, False) & set_pairs(p2, False))
n_pairs = len(p1) * (len(p1) - 1) / 2
return (a + b) / n_pairs
rand_index(p1, p2)
So this metric varies between 0 and 1, where 1 indicates a perfect agreement between the two partitions. An important property is that the Rand index is independent of the actual cluster labels:
p3 = [7, 7, 7, 8, 8, 8, 9, 9, 9]
rand_index(p3, p2)
def rand_sample(u1, u2, part_len, n_iter):
"""Measures the Rand index between random sequences n_iter times.
The two sequences have u1 and u2 unique elements, and are both of len = part_len."""
scores = np.zeros(n_iter)
for i in range(n_iter):
s1 = np.random.choice(u1, part_len)
s2 = np.random.choice(u2, part_len)
scores[i] = rand_index(s1, s2)
return scores
scores = rand_sample(3, 3, part_len=9, n_iter=1000)
plt.hist(scores, bins=10);
plt.title('Rand scores (random sequences len = 9)');
We see that $RI$ values between 0.4 and 0.7 are quite common when dealing with such a short sequence. We can try the same with longer sequences, and this is what we obtain:
scores_long = rand_sample(3, 3, part_len=100, n_iter=1000)
plt.hist(scores_long, bins=10)
plt.title('Rand scores (random sequences len = 100)');
The score for the longer sequences is quite sharply peaked between 0.54 and 0.58, so it will be very unlikely to obtain a value of 0.75 by chance.
The adjusted Rand index then takes this random variation into account as follows:
$$ARI = \frac{RI - E_{random}[RI]}{Max[RI] - E_{random}[RI]}$$
$ARI$ is simply a rescaled version of $RI$, computed by subtracting the mean $RI$ we would obtain by chance with random sequences (e.g. $E_{random}[RI] = 0.55$ for the first example above), and by dividing this value by the difference between the maximum and the average $RI$. The division is done to increase the sensitivity of the metric. This means that $ARI$ might be negative, if $RI < E_{random}[RI]$.
$ARI$ can be calculated exactly using the contingency table calculated from the two partitions. The formula can be found here, and it is used in the scikit-learn implementation.
However, I find it easier to understand the $ARI$ correction using the rand_sample function we saw above. This will give us an approximation, and it will be slower to compute since we need to sample many times to simulate the $RI$ distribution. On the other hand, we can clearly see the link between the code below and the $ARI$ definition:
def adjusted_rand(p1, p2, n_iter=5000):
u1, u2 = len(np.unique(p1)), len(np.unique(p2))
mean_rand = np.mean(rand_sample(u1, u2, len(p1), n_iter))
ri = rand_index(p1, p2)
return (ri - mean_rand) / (1 - mean_rand)
In the above code, I have used the simplification $Max[RI] = 1$. This would not work if the two sequences have different number of clusters.
adjusted_rand(p1, p2)
Note that the $ARI$ is substantially lower for our short sequences compared with the $RI = 0.75$. We can also calculate it on longer sequences and compare against the scikit-learn implementation.
p4, p5 = np.random.choice(3, 100), np.random.choice(3, 100)
p4[:50] = p5[:50] # some overlap between the sequences
adjusted_rand(p4, p5, 1000)
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(p4, p5)
The values are off by about 0.005, quite close!
#collapse
def cocluster_iter(range_cl, data, **kwargs):
"""Generator that fits coclustering for a range of different clusters."""
for n_clusters in tqdm(range_cl):
bicl = SpectralCoclustering(n_clusters=n_clusters, **kwargs)
bicl.fit(data)
yield n_clusters, bicl
def cluster_metrics(cocluster, metric, target):
"""Compute a given metric for a range of different n_clusters
cocluster: coclustering generator that yields a fitted cluster object
metric: metric callable, needs to accept metric(reference_labels, predicted_labels)
target: target variable against which to compute metrics.
"""
metrics = defaultdict(list)
for n_clusters, bicl in cocluster:
metrics['metric'].append(metric(target, bicl.row_labels_))
metrics['n_clusters'].append(n_clusters)
return pd.Series(metrics['metric'], index=metrics['n_clusters'])
cocluster = cocluster_iter(range(2, 25), df, random_state=0)
metrics = cluster_metrics(cocluster, adjusted_rand_score, res['Q5'].cat.codes)
ax = metrics.plot()
ax.set_title('Cluster metric against Q5 (Occupation)');
There is a sudden spike at n_clusters=9, so we will use that as a starting point.
%%time
bicl = SpectralCoclustering(n_clusters=9, random_state=0)
bicl.fit(df)
print('Column (question-response) cluster counts:', np.unique(bicl.column_labels_, return_counts=True)[1])
print('Row (user) cluster counts:', np.unique(bicl.row_labels_, return_counts=True)[1])
We obtain relatively balanced clusters, where the smallest user cluster contains 123 users, while the largest contains 2703 users. No single cluster dominates the questions or the users.
First, we study the relationship between Q5 and the clusters. We use altair for plotting, because it allows us to easily add interactive elements such as tooltips to our plots. Here is a bar chart of the distribution of job roles per cluster. Since the actual cluster numbers are arbitrary, we sort the bars based on the proportion of students, since clusters with a large proportion of students should behave differently than those with few or no students. You can hover over the individual bars to get the actual user counts.
#collapse
# create an analysis table where some of the questions are ordered
row_labels = pd.Series(bicl.row_labels_, index=df.index, name='cluster')
analysis = pd.concat([res[['Q4', 'Q5', 'Q6', 'Q7', 'Q15']], row_labels], axis=1)
q15_map = {
'< 1 years': '<= 2 years',
'1-2 years': '<= 2 years',
'3-5 years': '3-10 years',
'5-10 years': '3-10 years',
'10-20 years': '10+ years',
'20+ years': '10+ years',
}
simple_cats = ['<= 2 years', '3-10 years', '10+ years']
analysis['Q15_simple'] = pd.Categorical(analysis['Q15'].map(q15_map), categories=simple_cats,
ordered=True)
for name, col in analysis.iteritems():
if hasattr(col, 'cat') and col.cat.ordered:
analysis[f'{name}_order'] = col.cat.codes + 1
#collapse
# sort clusters by proportion of students in the cluster
sort_order = (analysis
.groupby('cluster')['Q5']
.apply(lambda s: (s == 'Student').mean())
.sort_values())
alt.Chart(data=analysis[['Q5', 'cluster']]).mark_bar(size=35).encode(
x=alt.X('cluster:N', sort=sort_order.index.tolist()),
y=alt.Y('count()', stack='normalize', title='Proportion (per cluster)'),
color=alt.Color('Q5', scale=alt.Scale(scheme='tableau20')),
tooltip=['Q5', 'count()']
).properties(
width=500
)
We can see that clusters on the left are dominated by Data Scientist, and followed by Software Engineer, Data Analyst and Research Engineer. Besides Student, clusters on the right have higher Not employed proportion.
#collapse
row_idx, col_idx = np.argsort(bicl.row_labels_), np.argsort(bicl.column_labels_)
df_sorted = df.iloc[row_idx, col_idx]
rows_sorted = bicl.row_labels_[row_idx]
cols_sorted = bicl.column_labels_[col_idx]
Since some responses are inherently more popular than others, we are first going to subtract the average per column (computed across all users).
We can visualize the data matrix, after the subtraction, as follows:
#collapse
df_center = df_sorted - df_sorted.mean()
fig = plt.figure(figsize=(7, 5))
ax = plt.gca()
im = ax.imshow(df_center, aspect=0.015, cmap='PRGn')
cbar = fig.colorbar(im, ax=ax)
cbar.ax.set_ylabel('Subtracted value', rotation=-90, va="bottom")
fig.tight_layout()
This heatmap gives us a good global picture of the dataset. Both rows and columns are sorted based on cluster: for example, cluster 0 is at the top left. Most of the high values in green are along the block diagonal, as we would expect from the coclustering. However, there are a few green blocks away from the diagonal, in particular in the lower left corner. We will need to keep an eye on those especially when we analyze clusters 7 and 8, which are the ones with the largest off-diagonal blocks.
Next, we are going to display average values per cluster, so we can characterize each cluster. First, we are going to create better labels for our columns. Currently, each column looks like this: Q27_Part_4; it is time-consuming to look up the questions and responses. Instead, what we want is something more readable, such as: 'NLP methods: Transformer language models (GPT-2, BERT,' We have all the necessary information in the uniques and short_qs mappings.
#collapse
def col_label(col, n_words=5):
"""Make a human-friendly (but short) label for a given column.
'Q26_Part_2' -> 'CV methods: Image segmentation methods (U-Net, Mask'
"""
answer_words = uniques[col].split()
answer_label = ' '.join(answer_words[:n_words]) # limit to n_words
q_index = col.split('_')[0] # 'Q26'
q_label = short_qs[q_index] # 'CV methods'
return q_label + ': ' + answer_label
col_label('Q27_Part_3')
We can now compute averages per cluster (using the centered data), and add the labels we created.
#collapse
avg_cluster = df_center.groupby(rows_sorted).mean()
avg_cluster.columns = avg_cluster.columns.map(col_label)
avg_cluster.head(2)
From the above, we note that cluster 0 is much more likely to use Image classification than cluster 1, for example. Using df_center helps emphasize the differences between the clusters.
We are finally ready to summarize the clusters. Based on the heatmap above, we will focus mostly on the cluster-specific responses (columns) that run along the block diagonal: for a cluster of users c, these are the columns that belong to column cluster c. We will also keep an eye for high values in the rest of the columns. Below is a simple function which will print a cluster summary, that is, the columns with the highest average (centered) score for a given cluster.
#collapse
def get_cluster_cols(c):
"""Split the average responses for a given cluster into two groups:
specific: these are the responses (columns) specific to a given cluster
others: the rest of the responses, that belong to a different cluster"""
specific = avg_cluster.loc[c, cols_sorted == c].sort_values(ascending=False)
others = avg_cluster.loc[c, cols_sorted != c].sort_values(ascending=False)
return specific, others
def print_summary(c, n_specific=10, n_other=5):
print('Cluster', c)
specific, others = get_cluster_cols(c)
print('Cluster-specific:')
print(specific.head(n_specific).round(3).to_string())
print()
print('Others (off-diagonal):')
print(others.head(n_other).round(3).to_string())
print()
print_summary(0, n_specific=15)
Based on this, users in cluster 0 are very focused on deep learning and computer vision. Let's contrast them with users from cluster 7 - this is one of the clusters whose responses are overlapping with those of cluster 0.
We see that cluster 7 is also focused on deep learning as well as experimentation and iteration and research. When we go back to our job role bar-chart, we note that cluster 0 contains the highest proportion of students across all clusters, whereas cluster 7 is a mix of mostly data scientists, software engineers and research scientists.
By keeping track of the off-diagonal elements, we are able to better characterize the two clusters. We can go a little further by examining differences in responses to Q15 ("years of writing code for data analysis")
# collapse
def stacked_bar_cluster(data, q):
"""Stacked bar with order information.
data: dataframe to plot, needs to contain the question, and cluster
q: name of the question, e.g. 'Q6'
"""
q_order = f'{q}_order'
order = q_order if q_order in data.columns else []
return alt.Chart(data=data).mark_bar(size=25).encode(
x=alt.X('cluster:N', sort=sort_order.index.tolist()),
y=alt.Y('count()', title='Proportion (per cluster)', stack='normalize'),
# need to provide a list with ordered categories to display correctly
color=alt.Color(f'{q}:O', scale=alt.Scale(scheme='inferno'),
sort=list(data[q].cat.categories)),
tooltip=[q, 'count()'],
# force an order on a categorical variable
order=order
).properties(
width=400,
height=280,
)
stacked_bar_cluster(analysis, 'Q15_simple').properties(title='Years writing code (data)')
The majority of cluster 0 users (65%) are fairly new to programming for data analysis (<= 2 years). In the case of cluster 7, this proportion is much smaller, at about 35%. In summary, there is some overlap in terms of the technologies that the two clusters are using but their experience and job roles are substantially different.
#collapse
stacked_bar_cluster(analysis, 'Q6').properties(title='Size of company')
Students were not asked about the size of the company, which is why we see the null values.
Data science segments
We can use this procedure (examine highest responses per cluster, cross-reference with the analysis questions) to characterize the rest of the clusters. I came up with four groups of clusters, and it took me about 30 minutes once I had the code above ready.
A note about the notation
Below I use a $\Delta$ when I quote the centered scores: for example, for cluster 0, the centered score for "using Keras" is $\Delta = 0.31$, and the actual average score is $0.76$. This means that 76% of users in cluster 0 have selected Keras, and this is a lot higher ($\Delta = 0.31$) when compared to the overall population.
#collapse
user_segments = {
'learners': [0, 6],
'beginners': [3, 4, 5],
'R users': [2],
'professionals': [1, 7, 8],
}
cluster_size = row_labels.value_counts(sort=False)
fig = plt.figure(figsize=(8, 5))
ax = fig.gca()
for i, (segment, clusters) in enumerate(user_segments.items()):
ax.bar(clusters, cluster_size.loc[clusters], label=segment,
color=matplotlib.cm.Pastel1(i))
ax.legend()
ax.set_xticks(range(bicl.n_clusters));
ax.set_xlabel('clusters', fontsize=12)
ax.set_ylabel('number of respondents', fontsize=12);
Learners: Cluster 0, 6
These two clusters have the highest proportion of students (at about 50%). These are also the groups with least experience in terms of years of programming (<= 2 years for the majority of users in both clusters). Both groups are active users of Kaggle Notebooks and Courses.
Cluster 0 users tend to focus on deep learning, especially on computer vision tasks (image classification: $\Delta = 0.49$, image segmentation: $\Delta = 0.26$). They use Keras ($0.72$) and Tensorflow ($0.70$) for building models.
Respondents from Cluster 6 use more traditional Python data analysis tools, such as Matplotlib ($\Delta = 0.15$), Seaborn ($\Delta=0.15$) and Scikit-learn ($\Delta = 0.12$).
Beginners: Clusters 3, 4, 5
These are the smallest clusters and together account for about 11% of the respondents. These users respond to most of the questions with None. Excel (or other related products) is the main software tool that shows up in all three clusters, especially in cluster 5 with a score of $\Delta = 0.40$. The most common job roles are: Business/Data Analyst, Software Engineer, Student.
Depending on the business purposes, it might be reasonable to merge these clusters together in a single group. There are a few differences, however. For example, users from cluster 4 have longer coding experience (more than half of them have 3+ years of coding experience), and there are almost no students in that cluster.
R Users: Cluster 2
This is the R cluster! Examining the highest responses makes this clear: R as language regularly used ($\Delta = 0.58$), RStudio as the IDE ($\Delta = 0.57$), ggplot2 for visualization ($\Delta = 0.53$), and again, R, as the language to recommend ($\Delta = 0.27$). In terms of ML algorithms, these users are more likely to fit linear and logistic regression models ($\Delta = 0.16$).
Here, we see a split of the job role distribution between students, data scientists and data analysts. In terms of coding experience, this cluster is in-between the beginner clusters above and the professional clusters we will examine below. This cluster captures about 8% of the users.
Professionals: Clusters 1, 7, 8
These are the most experienced groups of users. Each of these clusters have unique characteristics, but we first review some of the similarities. These are users with the most coding experience (with more than 60% of users in each group having 3+ years of experience). There are no students in these clusters, and conversely, these are clusters with largest proportions of data scientists (almost half of users in cluster 1). Now, onto the unique aspects of each cluster:
Cluster 7 ML Researchers
We mentioned cluster 7 earlier when we contrasted it with cluster 0. Cluster 7 users focus on experimentation and iteration to improve ML models ($\Delta = 0.28$), and do research to advance the state of the art in ML ($\Delta = 0.18$). Curiously, these are also the users that are most likely to have an employer with a mature ML ecosystem ($\Delta = 0.16$). The research focus is deep learning, and captures almost all domains, libraries, and models of deep learning. On the other hand, these users are not likely to use cloud products (cloud products - None: $\Delta = 0.27$) and big data products (big data products - None: $\Delta = 0.45$). Perhaps, this is because these users are building custom models and products that are not well-supported by mainstream cluod services.
Cluster 1 Analysts and prototype builders
Cluster 1 users focus on analyzing data to influence product decisions ($\Delta = 0.28$) and building prototype to explore ML applications ($\Delta = 0.25$). There is some overlap with cluster 7, but it appears that cluster 1 users are focusing on tabular data: they are more likely to use SQL ($\Delta = 0.17$), as well as all major SQL databases. They use algorithms traditionally applied to tabular data such as decision trees and random forests ($\Delta = 0.20$), and gradient boosting ($\Delta = 0.16$). They are also more likely to use R regularly ($\Delta = 0.24$).
More than half of the users in cluster 1 work for a large company (1000+ employees), which is the highest proportion of any cluster.
Cluster 8 Deployment and Cloud
Users in cluster 8 focus on putting models to production using the cloud. Virtually all cloud-related platforms, services and products fall under this cluster. These users are also more likely to use use AutoML products (Auto-Sklearn: $\Delta = 0.25$ and Google AutoML: $\Delta = 0.24$). These users focus on building the necessary data infrastructure ($\Delta = 0.22$) as well as analyzing data to influence product decisions ($\Delta = 0.21$).
Interestingly, Google Cloud is more popular with these users ($0.64$) compared to Amazon Web Services ($0.57$), and Microsoft Azure ($0.38$), which is very different from the market share of each of these cloud providers (AWS is well ahead, followed by Microsoft and Google). The over-representation of Google Cloud in this survey might be due to (1) Google Cloud's strong offerings in the machine learning space and (2) the fact that Kaggle is part of Google and it promotes some Google products.
We should also point out that this is the group with the highest response rates across all clusters (note all the green values for the last row-block in the heatmap).
UMAP projection
We are going to conclude our analysis by creating a UMAP projection of our dataset on the 2D plane. UMAP is a very useful nonlinear dimensionality reduction technique and deserves a tutorial of its own. We are not going to delve into the details here, and simply use it with these goals in mind:
- Assess the agreement between clustering labels and UMAP projection. If the two agree with each other, we can be more confidentent in our analysis.
- UMAP projections, especially plots of the underlying connectivity matrix, often look (to use the technical term) pretty cool. Plots that are pretty cool often attract attenion: they can be included in title slides or document highlights and help get more people excited about out data analysis.
Jaccard coefficient as a distance measure
Perhaps the most important UMAP parameter is the distance measure. Our dataset includes binary features only (yes / no selections) so the Jaccard coefficient is a good choice. It is defined as the size of the intersection of two sets $u$ and $v$, divided by the size of their union:
$$J(u, v) = \frac{|u \cap v|}{|u \cup v |}$$
Wikipedia has a nice graphic of the Jaccard coefficient. In our case, the numerator will count the number of matches (shared selections) between users $u$ and $v$. The denominator will normalize this count by the total number of unique selections of both $u$ and $v$. It is easier to have more matches with a user that has made a lot of positive selections, so we need the denominator to control for this effect.
Let's compute the Jaccard index for a few of the users in the toy dataset to gain a better intuition. Here is the dataset again
The Jaccard index for users 0 and 1 is: $$J(u_0, u_1) = \frac{1}{6} = 0.167$$ because they have a single match (Analyze / understand data), and a total of 6 unique selections. On the other hand: $$J(u_0, u_2) = \frac{2}{4} = 0.50$$ Note that both the numerator and the denominator changed because the intersection of $u_0$ and $u_2$ is larger, while the union is smaller.
The Jaccard index appears in other places in machine leanring; for example it is often used for the evaluation of image segmentation models. It is also closely related to cosine similarity, which works with continuous features, in addition to binary label sets. Indeed, using metric='cosine' in the projection below results in a very similar projection.
#collapse
mapper = UMAP(n_neighbors=15, min_dist=0.1,
metric='jaccard', random_state=0).fit(df)
First, we are going to plot the projection, colored by the cluster labels.
#collapse
umap.plot.points(mapper, labels=row_labels, color_key_cmap='tab10',
width=700, height=600);
Overall, there is a good agreement between the cluster labels and the projection. Here are some of my observations:
- There are two main structures in the projection (two wings of a butterfly?), with a clear separation between them. One includes mostly clusters with students (0, 2, 3, 5, 6) and the other one includes clusters without.
- Learners (clusters 0 and 6) are neighboring each other in the projection, and so are the Professionals (1, 7 and 8).
- The Beginners (3, 4, 5) are split into two groups, with clusters 3 and most of 5 on the one side.
#collapse
umap.plot.connectivity(mapper, edge_cmap='magma', background='black',
width=700, height=600);
Indeed, the connectivity matrix looks cool! It will make a fine presentation highlight.
In addition to the dense local connections, there are many connections between the two large structures that run in parallel to each other. This provides some additional information about the structure of the data which is lost in the 2D scatter plot. These parallel connections are encouraging because it appears that some of them link cluster 4 with 3 and 5 (near the top of the structure) suggesting that there is indeed a similarity between these, as noted previously.
Conclusion
This was a long journey! So what did we discover along the way?
- We found 6 segments of data science practitioners (after merging together the beginners group) based on 9 clusters. The segments are interpretable, and can potentially impact product offerings, especially if we could do additional analysis on how the different segments use the platform.
- We learned that the Rand Index is an intuitive measure to calculate cluster agreement, but it needs to be adjusted for chance (ARI to the rescue!).
- Finally, the UMAP projection (using the Jaccard index) showed some agreement with the clustering solution, and also produced a pretty cool connectivity plot.
Of course, we must always be cautious when we analyze survey results, and be aware of the different source of bias that might appear. This article lists major sources of bias in surveys.
I hope you found this tutorial useful! If you have any feedback, let me know in the comments below!